Missing data, measurement error, and the future of probability

Johannes Karl https://johannes-karl.com (Victoria University of Wellington)https://www.wgtn.ac.nz , Joseph Bulbulia https://josephbulbulia.netlify.app (Victoria University of Wellington)https://www.wgtn.ac.nz
2021-MAY-25

Multiple-Imputation as a way of handling Missing Data (JB)

Reading for Multiple-Imputation:

Honaker, King, and Blackwell (2011)

Bhaskaran and Smeeth (2014)

Blackwell, Honaker, and King (2017)

McElreath (2020)

https://gking.harvard.edu/category/research-interests/methods/missing-data

Missing data poses a problem. Missingness biases inference. However missingness it is not intractable problem if we assume that the mechanism giving rise to the missing data is random conditional on known features of the datset. Statistitians call this assumption “MAR: Missing at Random.”1

Let us visualise in a subset of the the longitudional NZ dataset

First how many Id’s per wave in this dataset:

df <- nz12 %>%
  select(
    Id,
    CharityDonate,
    Emp.JobSecure,
    Male,
    Employed,
    Relid,
    Wave,
    yearS,
    KESSLER6sum,
    Age,
    yearS
  )

# always inspect your dataframe
glimpse(df)
Rows: 4,140
Columns: 10
$ Id            <fct> 15, 15, 15, 15, 15, 15, 15, 15, 15, …
$ CharityDonate <dbl> 20, 0, 5, 10, 70, 0, 170, 160, 80, 1…
$ Emp.JobSecure <dbl> 4, 6, 6, NA, 7, 5, NA, 7, NA, 7, NA,…
$ Male          <fct> Male, Male, Male, Male, Male, Male, …
$ Employed      <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, …
$ Relid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, …
$ Wave          <fct> 2010, 2011, 2012, 2013, 2014, 2015, …
$ yearS         <dbl> 27, 347, 834, 1200, 1608, 2037, 2336…
$ KESSLER6sum   <int> 4, 4, 4, 4, 3, 4, 5, 4, 3, 5, 1, 2, …
$ Age           <dbl> 38.82820, 39.70431, 41.03765, 42.036…
nz12 %>%
  group_by(Wave) %>%
  summarise(Unique_Id = n_distinct(Id))
# A tibble: 10 x 2
   Wave  Unique_Id
   <fct>     <int>
 1 2010        414
 2 2011        414
 3 2012        414
 4 2013        414
 5 2014        414
 6 2015        414
 7 2016        414
 8 2017        414
 9 2018        414
10 2019        414

That’s not many, but the data will be useful to explore the multiple-imputation approach

We can visualise the data, using the naniar package

We can see substantial missingness for Emp.JobSecure.

Let’s explore this:

df%>%
  select(Wave, Emp.JobSecure) %>%
  group_by(Wave)%>%
  tally(is.na(Emp.JobSecure))
# A tibble: 10 x 2
   Wave      n
   <fct> <int>
 1 2010     89
 2 2011    106
 3 2012    118
 4 2013    117
 5 2014    129
 6 2015    133
 7 2016    414
 8 2017    167
 9 2018    172
10 2019    183

Lot’s of missingness in Emp.JobSecure and the question was not included in 2016

table1::table1(~ Wave|Emp.JobSecure, data = df, overall = FALSE)
1
(N=90)
2
(N=96)
3
(N=139)
4
(N=268)
5
(N=403)
6
(N=731)
7
(N=785)
Wave
2010 12 (13.3%) 12 (12.5%) 18 (12.9%) 31 (11.6%) 68 (16.9%) 90 (12.3%) 94 (12.0%)
2011 16 (17.8%) 14 (14.6%) 15 (10.8%) 45 (16.8%) 43 (10.7%) 84 (11.5%) 91 (11.6%)
2012 11 (12.2%) 10 (10.4%) 17 (12.2%) 41 (15.3%) 43 (10.7%) 86 (11.8%) 88 (11.2%)
2013 11 (12.2%) 11 (11.5%) 17 (12.2%) 35 (13.1%) 44 (10.9%) 90 (12.3%) 89 (11.3%)
2014 10 (11.1%) 10 (10.4%) 16 (11.5%) 28 (10.4%) 40 (9.9%) 91 (12.4%) 90 (11.5%)
2015 5 (5.6%) 10 (10.4%) 18 (12.9%) 27 (10.1%) 46 (11.4%) 78 (10.7%) 97 (12.4%)
2016 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
2017 7 (7.8%) 9 (9.4%) 12 (8.6%) 21 (7.8%) 46 (11.4%) 71 (9.7%) 81 (10.3%)
2018 8 (8.9%) 12 (12.5%) 13 (9.4%) 23 (8.6%) 35 (8.7%) 73 (10.0%) 78 (9.9%)
2019 10 (11.1%) 8 (8.3%) 13 (9.4%) 17 (6.3%) 38 (9.4%) 68 (9.3%) 77 (9.8%)

There are various methods for multiple imputation. First, let’s look at the Amelia package

library(Amelia)
# set seed
set.seed(1234)

# we need to pass a dataframe to Amelia
prep <- as.data.frame(df) # tibble won't run in amelia !!


# this is the key code
prep2 <- Amelia::amelia(
  prep,
  #dataset to impute
  m = 10,
  # number of imputations
  cs = c("Id"),
  # the cross sectional variable
  ts = c("yearS"),
  # Time series, allowing polynomials
  #ords =  none in this dataset, but use this command for ordinal data
  #logs = ,  # big numbers better to use the natural log
  sqrt = c("KESSLER6sum", "CharityDonate"),
  # skewed positive data such as K6
  noms = c("Male",  # nominal vars
           "Employed"),
  idvars = c("Wave"),
  # not imputing outcomes
  polytime = 3
) #https://stackoverflow.com/questions/56218702/missing-data-warning-r
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


-- Imputation 9 --

  1  2  3  4  5

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30

We can use trace plots to examine how Amelia imputed and with how much uncertainty:

Here are the imputations for Job Security (a random selection)

Amelia::tscsPlot(
  prep2,
  cs = c("15", "19", "20", "39", "549", "1078"),
  main = "Imputation of Job security",
  var = "Emp.JobSecure",
  ylim = c(0, 30)
)

Here are the imputations for Charity (another random selection). Note that there is a fair amount of uncertainty here:

Amelia::tscsPlot(
  prep2,
  cs = c("394", "1039", "1082", "340", "365", "1149", "1238" , "1253","1229"),
  main = "Impuatation of Charity",
  var = "CharityDonate",
  ylim = c(0, 10000)
)

We can center and scale our variables using the following code head(df)

prep3 <- Amelia::transform.amelia(
  prep2,
  Id = as.factor(Id),
  # redundant
  Age.10yrs = (Age / 10),
  years_s = scale(yearS, center = TRUE, scale = TRUE),
  years = yearS,
  KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale =TRUE),
  Employed = factor(Employed),
  Relid = scale(Relid, scale = TRUE, center = TRUE),
  Male = as.factor(Male),
  Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE),
  CharityDonate = as.integer(CharityDonate)
)

# center an d scale age
out <- Amelia::transform.amelia(prep3, Age_in_Decades_C = scale(Age.10yrs,scale =FALSE, center=TRUE))

Which variables do we need to include to estimate the causal effect of job security on charity?

Write your dag!

library(ggdag)
dg <-
  dagify(
    charity ~ jobsecure + employed + age + male + relid + years,
    jobsecure ~ employed + distress + male + years + age,
    distress ~ male  + employed + years + age,
    relid ~ male + years,
    age ~ years,
    labels = c(
      "charity" = "charity",
      "jobsecure" = "job security",
      "employed" = "employed",
      "age" = "age",
      "male" = "male",
      "relid" = "religious identity",
      "years" = "years"
    ),
    exposure = "jobsecure",
    outcome = "charity"
  ) %>%
  tidy_dagitty(layout = "nicely")

ggdag::ggdag_adjustment_set(dg)

To obtain an unbiased estimate of jobsecurity on charity we must condition on employed, male, age, and years.

We can write the model using the lme4 package, which is fast. I wrote a little function, recall that we have 8 data sets

# first write out the model equation
library(lme4)
mod_eq <-  'CharityDonate  ~  Emp.JobSecure_S  + Employed + Age_in_Decades_C +  Male  +  years_s +  (1|Id)' 

# run models iterating over imputed data
loop_glmer_model <-
  function(x, y) {
    # x is the mod equation, y is the data
    m <- 10
    mod <- NULL
    for (i in 1:m) {
      mod[[i]] <- glmer(x, data = y$imputations[[i]], family = "poisson")
    }
    return(mod)
  }

m_list <- loop_glmer_model(mod_eq, out)

Here is a function for obtaining the results:

# table of effects
loop_lmer_model_tab <- function(x) {
  mp <- lapply(x, model_parameters)
  out <- parameters::pool_parameters(mp)
  return(out)
}

# create table
tab_impute <- loop_lmer_model_tab(m_list)
tab_impute
# Fixed Effects

Parameter         | Coefficient |   SE |         95% CI | Statistic |      p
----------------------------------------------------------------------------
(Intercept)       |        6.27 | 0.50 | [ 5.28,  7.26] |     12.42 | < .001
Emp.JobSecure_S   |        0.03 | 0.03 | [-0.03,  0.08] |      0.95 | 0.342 
Employed [1]      |        0.16 | 0.03 | [ 0.09,  0.23] |      4.56 | < .001
Age_in_Decades_C  |       -4.41 | 1.27 | [-6.90, -1.92] |     -3.47 | < .001
Male [Not_Male]   |       -1.33 | 0.69 | [-2.69,  0.02] |     -1.92 | 0.054 
years_s           |        1.26 | 0.37 | [ 0.53,  1.98] |      3.41 | < .001
SD (Intercept)    |        6.31 |      |                |           |       
SD (Observations) |        1.00 |      |                |           |       

# create graph
plot_impute <- plot(tab_impute)
plot_impute

We can plot the effects using a coefficient plot

library(ggeffects)
library(gghighlight) # not used here, useful for interactions 
graph_predictions_imputed <- function(x, y) {
  # x = model objects
  m <- 10
  out <- NULL
  for (i in 1:m) {
    out[[i]] <-
      ggeffects::ggpredict(x[[i]], terms = c("Emp.JobSecure_S"))
  }
  plots <- NULL
  for (i in 1:m) {
    plots[[i]] <-
      plot(out[[i]], facets = T) # + scale_y_continuous(limits=c(6.35,6.85) )
  }
  plots[[10]] +
    gghighlight::gghighlight() +
    ggtitle(y)
}

# graph
graph_predictions_imputed(m_list,"Effect of Jobsecurity on Charity (not reliable")

If you want a LaTeX table, you can use this code:

library(huxtable)

huxtable::as_hux( your_model_here ) %>%
  select("Parameter", "Coefficient", "CI_low", "CI_high", "p") %>%
  set_number_format(3) %>%
  set_left_padding(20) %>%
  set_bold(1, everywhere) %>%
  quick_latex()

Compare imputation results with row-wise deleted results

When you run a regression with missing data, R automateically deletes the missing cases.

Let’s look at the results from the row-wise deleted data:

# prepare data as we did for the imputated dataset
df2 <- df %>%
  dplyr::mutate(
    Age.10yrs = (Age / 10),
    Age_in_Decades_C = scale(Age.10yrs, scale = FALSE, center = TRUE),
    years_s = scale(yearS, center = TRUE, scale = TRUE),
    years = yearS,
    KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale = TRUE),
    Employed = factor(Employed),
    Relid = scale(Relid, scale = TRUE, center = TRUE),
    Male = as.factor(Male),
    Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE)
  )

# run model
m_no_impute <- glmer(mod_eq, data = df2, family = "poisson")

# create table
tab_no <-
  parameters::model_parameters(m_no_impute, effects = c("all"))
tab_no
# Fixed Effects

Parameter        |  Log-Mean |       SE |         95% CI |      z |      p
--------------------------------------------------------------------------
(Intercept)      |      5.26 |     0.08 | [ 5.09,  5.42] |  62.00 | < .001
Emp.JobSecure_S  | -5.65e-03 | 5.87e-04 | [-0.01,  0.00] |  -9.61 | < .001
Employed [1]     |      0.19 | 6.46e-03 | [ 0.18,  0.20] |  29.18 | < .001
Age_in_Decades_C |     -0.89 |     0.04 | [-0.96, -0.81] | -21.86 | < .001
Male [Not_Male]  |     -0.64 |     0.11 | [-0.86, -0.43] |  -5.79 | < .001
years_s          |      0.18 |     0.01 | [ 0.16,  0.21] |  15.41 | < .001

# Random Effects

Parameter          | Coefficient
--------------------------------
SD (Intercept: Id) |        1.00
SD (Residual)      |        1.00

# create graph
plot_no <- plot(tab_no)
plot_no

When we compare the graphs, we see that the multiply imputed datasets shrink estimates towards zero.

Multiple imputation is sometimes avoided because people don’t like to “invent” data. However, creating multiply imputed datasets and integrating over their uncertainty during model tends to increase uncertainty in a model. That’s generally a good thing when we want to predict features of the population.

library(patchwork)

plot_impute / plot_no +
  plot_annotation(title = "Comparison of regressions using (a) multiple-imputed  and (b) row-wise deleted datasets",
                  tag_levels = 'a')

However it would be a mistake to think that multiple imputation is sufficient. Note that the case-wise deleted data is confident that men give less to charity than do women and other genders. However the model is tending to inflate estimates for men, perhaps because men tend to disproportionately leave this question unanswered. Let’s check this intuition:

df %>%
  dplyr::mutate(CharityNa = is.na(CharityDonate)) %>%
  count(Male, CharityNa) %>%
  group_by(Male) %>%
  mutate(freq = n / sum(n))
# A tibble: 4 x 4
# Groups:   Male [2]
  Male     CharityNa     n   freq
  <fct>    <lgl>     <int>  <dbl>
1 Male     FALSE      1506 0.972 
2 Male     TRUE         44 0.0284
3 Not_Male FALSE      2462 0.951 
4 Not_Male TRUE        128 0.0494

No, the intution was wrong. However, it is important to realise that missingness might not be completely at random conditional on variables in the model. In this case, you dataset cannot avoid its biases by multiple imputation.

Recall that it is generally useful to include more variables for accurate prediction. For this reason, when multiply imputing outcomes it is generally useful to include more variables.

Appendix 1: Imputation using BRMS

Multiple imputation in BRMS (warning: these models take 10x longer)

# prepare data
# you must create a list of the imputed datasets, like so:

brmdat <- list(
  out$imputations$imp1,
  
  out$imputations$imp2,
  
  out$imputations$imp3,
  
  out$imputations$imp4,
  
  out$imputations$imp5,
  
  out$imputations$imp6,
  
  out$imputations$imp7,
  
  out$imputations$imp8,
  
  out$imputations$imp9,
  
  out$imputations$imp10
)

We write the model as follows:

# model equation
# note the `bf` command

bf_mod_eq <- bf(mod_eq)

# we use the `brm_multiple` syntax and feed in the brmdat

fit_imp1 <-
  brm_multiple(
    bf_mod_eq,
    data = brmdat,
    family = "poisson",
    file = here::here("models", "bayes-imp-1")
  )

# table
summary(fit_imp1)
 Family: poisson 
  Links: mu = log 
Formula: CharityDonate ~ Emp.JobSecure_S + Employed + Age_in_Decades_C + Male + years_s + (1 | Id) 
   Data: brmdat (Number of observations: 4140) 
Samples: 40 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 40000

Group-Level Effects: 
~Id (Number of levels: 414) 
              Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept)     5.40      2.89     1.08     9.93 7.38
              Bulk_ESS Tail_ESS
sd(Intercept)       41       47

Population-Level Effects: 
                 Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept            7.33      1.83     4.51    11.56 8.47
Emp.JobSecure_S      0.03      0.02    -0.01     0.06 5.11
Employed1            0.16      0.03     0.11     0.20 4.43
Age_in_Decades_C    -3.14      2.04    -5.65     0.13 5.31
MaleNot_Male        -0.55      0.87    -2.81     0.56 7.08
years_s              0.89      0.59    -0.06     1.62 5.36
                 Bulk_ESS Tail_ESS
Intercept              41       40
Emp.JobSecure_S        42       86
Employed1              43       68
Age_in_Decades_C       42       49
MaleNot_Male           41       48
years_s                42       53

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
plot(fit_imp1)  # VERY POOR MIXING

Mixing is a problem, indeed I don’t think I have ever run a model that has mixed worse than this one.

We see the trouble in the coefficient plots:

#plot posteriors
plot_bayes_1 <- brms::mcmc_plot(fit_imp1,
                                type = "areas",
                                prob = .89)

plot_bayes_1 +
  plot_impute +
  plot_annotation(title = "Comparison of regressions using  (a) Bayesian Imputationand and (b) Lmer models",
                  tag_levels = 'a')

We can model missing-ness in BRMS in one step (for continuous missing variables). Let’s try that next:

# Emp.JobSecure_S  + Employed + Age_in_Decades_C +  Male  +  years_s

#Note that BRMS can only impute continuous data. That's not a problem. Your factors are converted to integars anyway.
df3 <- df2%>%
  dplyr::mutate(Employed = as.numeric(as.character(Employed)),
         Male = as.numeric((Male))-1) # male as zero or 1

glimpse(df3)
Rows: 4,140
Columns: 16
$ Id               <fct> 15, 15, 15, 15, 15, 15, 15, 15, 1…
$ CharityDonate    <dbl> 20, 0, 5, 10, 70, 0, 170, 160, 80…
$ Emp.JobSecure    <dbl> 4, 6, 6, NA, 7, 5, NA, 7, NA, 7, …
$ Male             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
$ Employed         <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, …
$ Relid            <dbl[,1]> <matrix[20 x 1]>
$ Wave             <fct> 2010, 2011, 2012, 2013, 2014,…
$ yearS            <dbl> 27, 347, 834, 1200, 1608, 2037, 2…
$ KESSLER6sum      <int> 4, 4, 4, 4, 3, 4, 5, 4, 3, 5, 1, …
$ Age              <dbl> 38.82820, 39.70431, 41.03765, 42.…
$ Age.10yrs        <dbl> 3.882820, 3.970431, 4.103765, 4.2…
$ Age_in_Decades_C <dbl[,1]> <matrix[20 x 1]>
$ years_s          <dbl[,1]> <matrix[20 x 1]>
$ years            <dbl> 27, 347, 834, 1200, 1608, 2037, 2…
$ KESSLER6sum_S    <dbl[,1]> <matrix[20 x 1]>
$ Emp.JobSecure_S  <dbl[,1]> <matrix[20 x 1]>


df3%>%
  dplyr::count(CharityDonate ==0)
# A tibble: 3 x 2
  `CharityDonate == 0`     n
  <lgl>                <int>
1 FALSE                 3453
2 TRUE                   515
3 NA                     172

## Write the model, note the `mi`s: each `mi` needs to appear as the outcome of a model. 
## note that for simplicity, these models claim that mi is random conditional on year and individual. We might obtain better predictions of missingness by including more information. 

# Note that brms can only take identity link functions when handling missing data. The "lognormal" is better than the normal because the variances are estimated using a log link. 

bform <- 
  bf(CharityDonate + 1 | mi() ~ mi(Emp.JobSecure_S) + mi(Employed) + mi(Age_in_Decades_C)  + mi(Male)  + (1 + years_s | Id), family = "lognormal") +
  bf(Emp.JobSecure_S | mi() ~ mi(Employed) + mi(Age_in_Decades_C)  + mi(Male)  + (1 + years_s| Id)) +
  bf(Employed | mi() ~ (1 + years_s| Id)) +
  bf(Age_in_Decades_C | mi() ~  (1 + years_s| Id))+
  bf(Male | mi() ~ (1 + years_s | Id)) +
  set_rescor(FALSE)

## fit the model 
fit_imp2 <- brm(bform, 
                data = df3, 
                file = here::here("models", "bayes-imp-2"))
summary(fit_imp2)
 Family: MV(lognormal, gaussian, gaussian, gaussian, gaussian) 
  Links: mu = identity; sigma = identity
         mu = identity; sigma = identity
         mu = identity; sigma = identity
         mu = identity; sigma = identity
         mu = identity; sigma = identity 
Formula: CharityDonate + 1 | mi() ~ mi(Emp.JobSecure_S) + mi(Employed) + mi(Age_in_Decades_C) + mi(Male) + (1 + years_s | Id) 
         Emp.JobSecure_S | mi() ~ mi(Employed) + mi(Age_in_Decades_C) + mi(Male) + (1 + years_s | Id) 
         Employed | mi() ~ (1 + years_s | Id) 
         Age_in_Decades_C | mi() ~ (1 + years_s | Id) 
         Male | mi() ~ (1 + years_s | Id) 
   Data: df3 (Number of observations: 4140) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~Id (Number of levels: 414) 
                                                     Estimate
sd(CharityDonate1_Intercept)                             1.25
sd(CharityDonate1_years_s)                               0.16
sd(EmpJobSecureS_Intercept)                              0.35
sd(EmpJobSecureS_years_s)                                0.17
sd(Employed_Intercept)                                   0.13
sd(Employed_years_s)                                     0.04
sd(AgeinDecadesC_Intercept)                              0.55
sd(AgeinDecadesC_years_s)                                0.07
sd(Male_Intercept)                                       0.36
sd(Male_years_s)                                         0.00
cor(CharityDonate1_Intercept,CharityDonate1_years_s)    -0.36
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)       0.25
cor(Employed_Intercept,Employed_years_s)                 0.15
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)      -0.30
cor(Male_Intercept,Male_years_s)                         0.26
                                                     Est.Error
sd(CharityDonate1_Intercept)                              0.70
sd(CharityDonate1_years_s)                                0.19
sd(EmpJobSecureS_Intercept)                               0.46
sd(EmpJobSecureS_years_s)                                 0.18
sd(Employed_Intercept)                                    0.17
sd(Employed_years_s)                                      0.06
sd(AgeinDecadesC_Intercept)                               0.32
sd(AgeinDecadesC_years_s)                                 0.09
sd(Male_Intercept)                                        0.10
sd(Male_years_s)                                          0.00
cor(CharityDonate1_Intercept,CharityDonate1_years_s)      0.56
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)        0.55
cor(Employed_Intercept,Employed_years_s)                  0.31
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)        0.26
cor(Male_Intercept,Male_years_s)                          0.31
                                                     l-95% CI
sd(CharityDonate1_Intercept)                             0.07
sd(CharityDonate1_years_s)                               0.00
sd(EmpJobSecureS_Intercept)                              0.00
sd(EmpJobSecureS_years_s)                                0.01
sd(Employed_Intercept)                                   0.00
sd(Employed_years_s)                                     0.00
sd(AgeinDecadesC_Intercept)                              0.04
sd(AgeinDecadesC_years_s)                                0.00
sd(Male_Intercept)                                       0.20
sd(Male_years_s)                                         0.00
cor(CharityDonate1_Intercept,CharityDonate1_years_s)    -0.97
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)      -0.46
cor(Employed_Intercept,Employed_years_s)                -0.35
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)      -0.72
cor(Male_Intercept,Male_years_s)                        -0.22
                                                     u-95% CI
sd(CharityDonate1_Intercept)                             1.89
sd(CharityDonate1_years_s)                               0.47
sd(EmpJobSecureS_Intercept)                              1.15
sd(EmpJobSecureS_years_s)                                0.45
sd(Employed_Intercept)                                   0.41
sd(Employed_years_s)                                     0.13
sd(AgeinDecadesC_Intercept)                              0.89
sd(AgeinDecadesC_years_s)                                0.23
sd(Male_Intercept)                                       0.45
sd(Male_years_s)                                         0.00
cor(CharityDonate1_Intercept,CharityDonate1_years_s)     0.33
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)       0.97
cor(Employed_Intercept,Employed_years_s)                 0.47
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)       0.02
cor(Male_Intercept,Male_years_s)                         0.65
                                                     Rhat
sd(CharityDonate1_Intercept)                         3.31
sd(CharityDonate1_years_s)                           3.47
sd(EmpJobSecureS_Intercept)                          3.78
sd(EmpJobSecureS_years_s)                            3.10
sd(Employed_Intercept)                               3.01
sd(Employed_years_s)                                 3.48
sd(AgeinDecadesC_Intercept)                          3.31
sd(AgeinDecadesC_years_s)                            2.94
sd(Male_Intercept)                                   3.57
sd(Male_years_s)                                     4.27
cor(CharityDonate1_Intercept,CharityDonate1_years_s) 4.06
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)   3.44
cor(Employed_Intercept,Employed_years_s)             3.12
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)   3.27
cor(Male_Intercept,Male_years_s)                     3.42
                                                     Bulk_ESS
sd(CharityDonate1_Intercept)                                4
sd(CharityDonate1_years_s)                                  4
sd(EmpJobSecureS_Intercept)                                 4
sd(EmpJobSecureS_years_s)                                   5
sd(Employed_Intercept)                                      5
sd(Employed_years_s)                                        4
sd(AgeinDecadesC_Intercept)                                 4
sd(AgeinDecadesC_years_s)                                   5
sd(Male_Intercept)                                          4
sd(Male_years_s)                                            4
cor(CharityDonate1_Intercept,CharityDonate1_years_s)        4
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)          4
cor(Employed_Intercept,Employed_years_s)                    5
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)          4
cor(Male_Intercept,Male_years_s)                            4
                                                     Tail_ESS
sd(CharityDonate1_Intercept)                               11
sd(CharityDonate1_years_s)                                 16
sd(EmpJobSecureS_Intercept)                                11
sd(EmpJobSecureS_years_s)                                  18
sd(Employed_Intercept)                                     20
sd(Employed_years_s)                                       19
sd(AgeinDecadesC_Intercept)                                15
sd(AgeinDecadesC_years_s)                                  32
sd(Male_Intercept)                                         12
sd(Male_years_s)                                           11
cor(CharityDonate1_Intercept,CharityDonate1_years_s)       14
cor(EmpJobSecureS_Intercept,EmpJobSecureS_years_s)         13
cor(Employed_Intercept,Employed_years_s)                   15
cor(AgeinDecadesC_Intercept,AgeinDecadesC_years_s)         23
cor(Male_Intercept,Male_years_s)                           15

Population-Level Effects: 
                                  Estimate Est.Error
CharityDonate1_Intercept              4.46      0.03
EmpJobSecureS_Intercept               0.11      0.12
Employed_Intercept                    0.70      0.01
AgeinDecadesC_Intercept              -0.02      0.05
Male_Intercept                        0.63      0.01
CharityDonate1_miEmp.JobSecure_S      0.01      0.02
CharityDonate1_miEmployed             0.35      0.10
CharityDonate1_miAge_in_Decades_C     0.15      0.06
CharityDonate1_miMale                -0.15      0.13
EmpJobSecureS_miEmployed              0.08      0.07
EmpJobSecureS_miAge_in_Decades_C      0.09      0.05
EmpJobSecureS_miMale                 -0.25      0.13
                                  l-95% CI u-95% CI Rhat
CharityDonate1_Intercept              4.43     4.50 3.22
EmpJobSecureS_Intercept              -0.01     0.29 3.30
Employed_Intercept                    0.69     0.72 3.49
AgeinDecadesC_Intercept              -0.08     0.04 3.30
Male_Intercept                        0.62     0.65 3.29
CharityDonate1_miEmp.JobSecure_S     -0.02     0.04 3.90
CharityDonate1_miEmployed             0.21     0.48 3.34
CharityDonate1_miAge_in_Decades_C     0.06     0.21 3.01
CharityDonate1_miMale                -0.30     0.05 3.43
EmpJobSecureS_miEmployed             -0.03     0.17 3.98
EmpJobSecureS_miAge_in_Decades_C      0.04     0.17 3.54
EmpJobSecureS_miMale                 -0.45    -0.10 3.34
                                  Bulk_ESS Tail_ESS
CharityDonate1_Intercept                 4       15
EmpJobSecureS_Intercept                  4       19
Employed_Intercept                       4       16
AgeinDecadesC_Intercept                  4       12
Male_Intercept                           4       14
CharityDonate1_miEmp.JobSecure_S         4       12
CharityDonate1_miEmployed                4       11
CharityDonate1_miAge_in_Decades_C        5       22
CharityDonate1_miMale                    4       15
EmpJobSecureS_miEmployed                 4       12
EmpJobSecureS_miAge_in_Decades_C         4       15
EmpJobSecureS_miMale                     4       13

Family Specific Parameters: 
                     Estimate Est.Error l-95% CI u-95% CI
sigma_CharityDonate1     1.69      0.45     1.34     2.47
sigma_EmpJobSecureS      1.43      0.20     1.09     1.63
sigma_Employed           0.39      0.10     0.21     0.46
sigma_AgeinDecadesC      0.74      0.60     0.01     1.39
sigma_Male               0.00      0.00     0.00     0.00
                     Rhat Bulk_ESS Tail_ESS
sigma_CharityDonate1 3.45        4       16
sigma_EmpJobSecureS  3.12        5       11
sigma_Employed       3.64        4       11
sigma_AgeinDecadesC  3.09        5       20
sigma_Male           3.22        4       11

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Again, the model is a all over the place:

plot(fit_imp2)

Mixing was a problem. Minimally we’d need to include more variables. However there might be deeper problems with the data, or our approach. Analysis is iterative. We need to return to the data and scrutinise it more carefully.

plot_bayes_2 <- brms::mcmc_plot(fit_imp2,
                                type = "areas",
                                prob = .89)

plot_bayes_2

# again we run into trouble. 
plot(fit_imp2)

Appendix 2. Probability

We have been working with probability throughout this course.

Suppose there is a test that is 99% accurate at detecting COVID if you have it.

Very rarely it throws up a false positive,say one in a thousand.

You just tested positve. What is the probability that you have COVID? Our intuition is that we probably have COVID. However, let’s assume COVID is rare. Currently in NZ, there are about 50 cases, so 1 in 100,000. The background rate matters.

Bayes rule says

\[ Pr(COVID|Positive) = \frac{Pr(Positive|COVID)\times Pr (COVID}{Pr(Positive)} \]

We plug in the numbers:

Pr_Positive_COVID <- 0.99
Pr_Positive_Healthy <- 0.01
Pr_COVID <- 0.00001

# Calculate the background probability of testing positive

Pr_Positive <- Pr_Positive_COVID * Pr_COVID + Pr_Positive_Healthy * ( 1 - Pr_COVID )

# Now calculated your probability of testing positive

Pr_COVID_Positive <- Pr_Positive_COVID * Pr_COVID / (Pr_Positive )
Pr_COVID_Positive # 1 in 1000
[1] 0.0009890307
Bhaskaran, Krishnan, and Liam Smeeth. 2014. “What Is the Difference Between Missing Completely at Random and Missing at Random?” International Journal of Epidemiology 43 (4): 1336–39. https://doi.org/10.1093/ije/dyu080.
Blackwell, Matthew, James Honaker, and Gary King. 2017. “A Unified Approach to Measurement Error and Missing Data: Overview and Applications.” Sociological Methods and Research 46 (3): 303–41. http://journals.sagepub.com/doi/full/10.1177/0049124115585360.
Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45 (7): 1–47. http://www.jstatsoft.org/v45/i07/.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. CRC press.

  1. For a wonderful explanation, see: here↩︎

References